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ABSTRACT 


The demand for the large number of instances grows exponentially with 
dimensionality of feature space. In addition, greater computational complexity is 
implied by hi^er dimensionality. Various modelling techniques face such problems. So 
before inputting the samples or instances for modelling, it is always preferred to 
preprocess the available data set, which include data prioritisation and feature 
clustering. 

Principal Component Analysis (PCA) and Independent Component Analysis 
(ICA) are two commonly used data prioritisation techniques, which absorb most of data 
variation with smaller dimensions. In PCA, we have uncorrelated components and in 
ICA, we have independent components. 

Heterogeneity is always present in the data, which also offers problems in 
modelling. Clustering or grouping of samples (generally features based) is an answer to 
such problems. K-means clustering, partitions the total data set in to k classes. 
Sometimes, in K-means clustering, there may be some classes which are empty. Such 
problem is taken care in fuzzy C-means clustering, in which each sample belongs to all 
c classes with some membership. A class of neural networks also performs clustering 
and classification. Kohenon self organising map performs similar action to K-means 
method. But, two unsupervised networks ART2 and Fuzzy ART classify the samples 
depending up on a factor called vigilance factor. The advantage of these networks is 
that, we need not to specify the number of classes in advance, hiductive reasoning (IDS) 
can also be applied for this purpose. 

All these techniques have been applied to steel converter lining life prediction 
problem. Using PCA and ICA, we could reduce Ihe dimension of the system to 15 x 13 
from 15 X 26. Then these 15x 13 systems are given as input to various modelling 
techniques. Generally, ICA is giving good results. In clusterwise modelling, K-means, 

Fuzzy C-means and kohenon clustering give better results. 
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CHAPTER 1 


INTRODUCTION 

The demand for large number of instances grows exponentially with the 
dimensionality of the feature space. In addition, greater computational complexity is 
implied by higher dimensionality. A variety of methods for dimensionality reduction 
have been proposed, such as following: 

• Principal Component Analysis 

• Independent Component Analysis 

• Feature Clustering 

The first two approaches reduce dimensionality by either prioritising the 
features or by forming the linear combinations of them. The third approach merges 
feature which are highly correlated since they provide redundant information. 

This section presents a brief idea about Principal Component Analysis 
(PCA), Independent Component Analysis (ICA) and various clustering algorithms such 
as K-means clustering. Fuzzy C-means clustering, Kohenon Self Organizmg Map, 
ART2, Fuzzy ART and classification tree method such as IDS. 

1.1 Brief review of techniques used 
(i) Principal Component Analysis 

Principal Component Analysis is concerned with the variance - covariance 
structure of the multivariate system. This is most widely used method for data reduction. 
Actually, PCA is related vrith the eigenvectors of the covariance (correlation) matrix of 
the ^ven data set. The associated eigenvalue represents the percentage loading of this 
particular component on the system. Each PC is related with some features, which are 
dominant in the direction described by it. A threshold level for cumulative loading of 
components can be fixed according to the application. As we achieve this threshold we 
can stop considering further components. Then we select the most dominant feature for 
each successive PC. These many features are retained and rest are deleted. 



(ii) Independent Component Analysis 

Independent Component Analysis is actually an extension to the 
Principal Component Analysis. Actual dimensionality reduction (selection of number of 
components) is decided with PCA and ICs are used for prioritising the features. In PCA, 
we get uncorrelated components but in ICA we have independent components. A neural 
algorithm proposed in [6] is implemented here. The method used in feature 
prioritisation is same as that xised in PCA. 

(iii) K-means clustering 

K-means clustering technique is used for grouping the total available 
samples in k clusters. The k is specified in advance before actually implementing the 
algorithm. Generally clustering is performed by using similarity between the patterns. 
This method uses distance (either EucHdean or any other) as similarity measure. First 
the tentative cluster centers are calculated (or specified) and the distance of a pattern is 
calculated with each cluster center. The pattern belongs to the cluster whose cluster 
center is most similar to it i.e. the distance is minimum. Once a new pattern is added to 
a cluster its center is shifted. The algorithm is repeated till no change in classification 
takes place. 

(iv) Fuzzy C-means Clustering 

Fuzzy C-means method is an extension to the k-mean clustering. Here 
each pattern belongs to all C clusters with a membership. The sum of all these 
membership values is always 1. A pattern is said to be classified to the cluster where its 
membership is hipest. Here also Euclidean distance is used as similarity measure. 

(v) Self Organizing Map ( SOM) 

Kohenon network performs clustering through a competitive learning 
mftf.hanigm called “w inn er takes all”. In essence, the node with largest activation level is 
declared the w inner in the competition. This node is the only node, siqjpressed to the 
zero activation level. Fxirthermore, this node and its neighbours are the only nodes 
permitted to leam for the cxirrent input pattern. After training, the weight vector of each 
node encodes the information of a groiq) of similar input patterns. Given an input 



vector, it is assigned to the node with the maximum activation. Since the number of 
nodes is fixed, the net algorithm is similar to the K-means clustering algorithm. 

(vi) Adaptive resonance theory (ART2) 

ART2 is widely used clustering technique for analog or continuoiis valued 
patterns. The patterns are classified or clustered with the accuracy defined by a factor 
called “vigilance factor”. Their ability to generalize is limited; however, the ability of 
this network to create new pattern class in its knowledge base on the arrival of novel 
pattern makes it very suitable for clustering. The classification is dependent on the 
presentation of input patterns. 

(vii) Fuzzy ART 

Fuzzy ART, can classify both binary and analog valued patterns. In this network 
also, the clustering is mainly dependent on the factor called vigilance factor and order of 
presentation of input patterns also plays role in classification. This is also an 
unsupervised network, because we need not to specify number of clusters. This is 
determined by the network itself depending upon the vigilance factor 
(viii) IDS algorithm 

Ids algorithm is a classification tree approach based on the entropy 
calculation. It can also be called as inductive learning approach. For using this method 
for clustering first input set has to be modified. We are required to divide each feature in 
some ranges. Then a classification tree is built based on minimum entropy approach. At 
every node a feature having mintTnum entropy is selected and then a decision is taken 
there. Classification continues till all the available final nodes are leaf nodes. All the 
patterns belonging to same leaf node belongs to same group. 

1.2 Problem statement 

To select the important variables and to cluster the data samples for 
modelling and prediction of steel making converter life using operating parameters. 

The life of steel making converter depends on various features. Here an 
attempt has been made to categorize the variables and obtain a reduced dimensioned 
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feature space. Then based on these selected features, clustering of the samples has been 
done to find out the heterogeneity in the data. 

1.3 Organization of thesis 

Data prioritisation techniques Principal Component Analysis and 
Independent Component Analysis is discussed in chapter 2. Chapter 3 contains detailed 
description of all the clustering algorithms. Results of respective techniques are given 
along with there discussion. In chapter 4, a comparative study of all these methods is 
given. 



CHAPTER 2 


DATA PRIORITISATION TECHNIQUES 


Data prioritisation or feature prioritisation is an important aspect of 
dimensionality reduction. Feature prioritisation is related with the selection of feature 
which can capture most of the variation available in the total variables. Principal 
Component Analysis (PCA) and Independent component Analysis (ICA) can perform 
this task. This chapter contains a detailed description of these two techniques. 

2.1 Principal Component Analysis 

The central idea of principal component analysis (PCA) is to reduce the 
dimensionality of a data set which consists of a large number of interrelated variables 
, while retaining as much as possible of the variation present in the data set . This can be 
achieved by two methods . One ,by transforming the variables to a new set of variables 
called the principal components (PCs) which are uncorrelated , which are ordered so 
that the first few retain most of the variation present in all of the original variables . 
Second , by priortising the original variables based on obtained PCs .Then we select the 
variables according to their preference . These selected variables cover most of the 
information contained in the total data set. 

So, we can say although p (total no. of variables) components are required to 
produce the total system variability , often much of this variability can be accounted for 
by a small number k , of the principal components. If so, there is (almost) as miKh 
information in k variables as there is in the p variables. 

In proposed work die dimensionality reduction is carried out by priortising die 
variables. 

Algebraically, principal components are linear combination of the p random 

Xi,X 2 ,., Xp variables . Geometrically, these linear combinations represent the 

selection of a new co-ordinate system obtained by rotating the original system widi 
Xi,X 2 , .Xp as the co-ordinate axes. These new axes represent the directions with 



maximum variability and provide a simple and more parsimonious description of the 
covariance structure. 

Principal components depend solely on the covariance matrix S (or the 

correlation matrix p) of Xi,X 2 , *Xp. Their development does not require a 

multivariate normal assumption. On the other hand, principal components derived for 
multivariate normal populations have useful interpretations in terms of constant density 
ellipsoids. Further, inferences can be made from the sample components when the 
population is multivariate normal. 

Let the random vector X' = [Xi,X 2 , Xpjhave covariance matrix Z with 

eigen values Xj ^ ^ A,p > 0 . 

Consider the linear combinations 

Y, = i;x=l„x,+l,,x,+ Ip.Xp 

y, = i;x = i^3X,+i^x,+ lp,Xp 


Y„ = I'X = 1,„X. + 1,„X,+ 1 X 

Pj p Ip 1 2p 2 PP P 

The principal components are those uncorrelated linear combinations 
YpYj, Yp whose variances are as large as possible. 

The first principal component is the linear combination with maximum variance. 
That is, it max imiz es Var (Yi)=ii Zlj . It is clear that Var(Yi)=l,Zli can be increased 
by multiplying any li by some constant. To eliminate this indeterminacy ,it is convenient 
to restrict attention to coefficient vectors of unit length. We therefore define 
First principal component = linear combination of I jX that maximizes 

Var(riX) subject to l'jli= 1 



Second principal component =linear combination of IjXthat maximizes 
Var( I2X ) 


subject to l2l2=land Cov( ijXjljX )=0 
i ^ principal component =linear combination of l,X that maximizes 

Var( i;X )subjectto i;ii=land Cov( iXl^X )=0 


for k<i 


Principal components can be evaluated by performing eigen value - eigen vector 
analysis on covariance matrix Z (or correlation matrix p when variables are of different 
dimensions ). Let Z have the eigenvalue -eigenvector pairs 

(^p.Sp) >A,p> 0 .The i*^ principal - 

component is given by 

Y. = e.X = ©iiXi + e2jX2+ ®pi^p ’ i p 

With these choices , 

Var(Y,) = e;Sei=).i i=lX P 

Cov(Y,Y,)=e;Zet = 0 i*k 

If some are equal, the choice of corresponding coefficient vectors e. , and 
hence Y. , are not unique . 

Therefore, the principal components are uncorrelated and have variances equal 
to the eigenvalues of Z-Hence 

total sample variance =X.j + A2+ •+^p 

consequently, the proportion of total variance due to k* principal component is 



( 




proportion of total 
population variance 
due to k* principal 
i^component j 


^ 

Xj + X2+ 


.+Xp 


k=l,2 ,p 


If naost (for instance, 80 to 90% ) of the total population variance, for large p, 
can be attributed to the first one, two or tibree components, then these components can 
replace the original p variables without much loss of information. 

Each component of the coefficient vector e; = [ej,,e 2 i, ,ep,| also merits 

inspection. The magnitude of e^ measures the importance of variable to die i* 
principal component. In particular, ej^is proportional to the correlation coefficient 
between Yi and Xk . 

In proposed work, the variables have been prioritised and depending upon the 
sample variances small number of variables have been selected for modelling purpose. 
The results are then compared with the variables selected by the R & D persons. 

2.1.1 Selecting a subset of Variables 

When p, the number of observed, is large it is often the case that a subset of m 
variables with m<p, will contain virtually all the information available in all p variables. 
It would be useful to deter min e an appropriate value of m, and to determine which 
subset of m variables is best. 

Solution of these two problems, the choice of m and the selection of a good 
subset, depends on the pmpose to which the subset of variables is to be put. If the 
purpose is simply to preserve most of variation (as in our problem) then the PCs can be 
used fairly straightforward to solve both problems. 

Regarding the choice of m, the following methods have been proposed 

(i) the total sample variation with m variables is calculated, which is the measure of 
variation contained in m variables. If it is about 80-90% then we can go for these small 
number of variables. 

(ii) the PCA is performed on correlation matrix p rather than covariance matrix Z .Then 
the those many components can be chosen for which X ^ 1. 







Moving on to the choice of m variables, following method is used 
Associate one variable with each of first m PCs, namely the variable not already chosen, 
with highest coeflBcient(value of e^ ), in absolute value, in each successive PC. These m 
variables are retained, and the remaining m* = p-m are deleted. 

2.1.2 Result of PCA 

PC A has been performed on all the data sets of 15 campaigns, but results of only 
two campaigns (1\2 and 1\4) are presented as all the results are almost similar. As given 
in table 2.1.2 for campaign no. 1\2, the cumulative loading for 14 components is 90.87% 

i.e. 14 components capture 90.87% of the total variation present in data. Similarly, in 
table 2.1.4 for campaign no. 1\4, the cumulative loading for 14 components is 90.72%. 
Therefore, we see that 14 components cover almost 90% of the total information 
present in data sets. The variables are selected using aforesaid criterion. With reference 
to table 2.1.1, following variables are chosen for campaign no. 1\2 

1. %CaOinslag 

2. % P in Hot metal 

3. Hot metal weight 

4. Average Lime addition 

5. Tap-Tap time 

6. Tapping Temperature 

7. Ore addition 

8. Hot metal Temperature 

9. Bath Carbon 

10. Scrap weight 

11. Hot metal Carbon 

12. Dolo addition 

13. % Si in hot metal 

14. % FeO in slag 

For campaign 1\4, following variables are chosen using table 2.1.3 

1. % CaOinslag 

2. Hot metal wei^t 




taWe Z1.4 : component loading for campaign no. 1\4 










































3. Bath Phosphorus 

4. Bath Sulphur 

5. Mu in Hot metal 

6. Tap-tap time 

7. Average Lime addition 

8. %Si in Hot metal 

9. Blow Oxygen 

10. Ore addition 

11. Lance Ht. 

12. Tapping Temperature 

13. Slag Basicity 

14. %FeO in slag 

Similarly, for all campaigns PCA has been performed and common variables are 
finally selected. The list of selected variables is given in table 2.1.5. 

2.2 Independent Component Analysis 

Independent component analysis (ICA) is an extension of Principal component 
analysis (PCA), that has been developed in context with blind separation of independent 
sources fi:om their linear mixtures. In a sense, the starting point of ICA is the 
uncorrelatedness property of standard PCA. Roughly speaking, rather than requiring that 
the coefficients of a linear expansion of the data vectors be uncorrelated. In ICA they 
must be mutually independent (or as indepaident possible). 

In proposed work, neural network implementation of ICA as suggested in [5] is 
implemented. 

The ICA network structure 

Let us denote Xk = [xk(l),Xk(2), Xk(L)]^ the L-dimensional k^ data set, 

where k = 1,2, L = no. of features or variables, M = no. of independent 

components. 

Suppose we write a signal model, 

M 

Xk = Ask + Ilk = 2] Sfc (i)a[i] + 



Here Sk = [sic(l),sic(2), Sk(M)]^ is the source vector consistiag of M source 

signals (independent components) Sk(i) ( i= 1,2, M ) at the index value k. A = 

[a(l),a(2), a(m)] is a constant Lx M “ mixing matrix ” whose columns a(i) are 

the basis vectors of ICA, and nt possible corrupting additive noise. 

Now, let us define the estimated expansion by 
Xk = Qyk + 

where yk is the estimate of original independent signal Sk- 
or suppose 

yk = BkXk 

where Bk is an M x L matrix called separating matrix. 

Here, the L x M matrix Q denotes the estimate of the ICA basis matrix A, Vk is the 
estimate of the independent component vector Sk and n 'l, is the noise or error term. The 
first task is separation of sources or estimation of the vector yk. This can be done by 
learning separating matrix B using some suitable algorithm. After this the components 
of the vector yk should be as independent as possible. For learning the matrix Q, we 

then simply minimise the mean square error E|||n,. |” | = - Qy ^1” | with respect to 

Q. 

This estimation procedure can be realised using the two layer feed forward 
network. 

2.2.1 Algorithm for ICA 

The algorithm for ICA consists of three steps: 

(1) Whitening of the data set. 

(2) Separating algorithm. 

(3) Estimation of the Basis Vectors of ICA 
Whiterving: 

Prior to inputting the data vectors Xk to the ICA networks, they are made zero 
mean by subtractmg the mean, if necessary. This normalises the data with respect to the 
first order statistics. Next, the data vectors are made uncorrelated using Principal 



Component Analysis (PCA). PCA is often used for whitening, because then one can 
simultaneously reduce the dimensions also. The PCA whitening matrix is given by 
V = D-'^=E" 

Here the M x M diagonal matrix D = diag[ X(l),X(2), X (M) ], and L x M matrix E 

= [c(l),c(2), jC(M)], with A,(i) denoting the i*^ largest eigenvalue of the data 

correlation matrix, and c(i ) the respective i* principal eigenvector. 

Now, the whitened vectors are given by 
Vk = Vxk 

Separating Algorithm 

The core part in ICA is learning of the separatmg matrix B, wliich can be written 
as 

B=W’^V 

where, W is the orthogonal M x M separating matrix applied to whitened vectors Vk. 
Learning rule: 

+p,y[v„„ -W,„g(y„y)]g(y„,<j)^ 

where, 

fij. = Learning rate 
g(t) = tanh(t) 

Estimation of the Basis Vectors of ICA 

After calculating the separating matrix B, the basis vectors can be calculated 
easily using following relation: 

a = b^(bb^)'’ =ed’'-w 

After es timatin g these basis vectors, we can use them for selecting the variables 
as in PCA, i.e. Associate one variable with each of basis vectors, namely the variable 
not already chosen with highest coefScient, in absolute value in each successive basis 
vector. These M variables are retained and remaining variables are deleted. 

2.2.2 Results of ICA 

As PCA, ICA has also been performed for all 15 campaigns, but r^ults of only 2 
campaigns (1\2 and 1\4) are presented here. From table 2.1.2 and 2.14, we see that 13 
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components cover almost 88% of the total information. These 13 components for 
campaign no. 1\2 and 1\4 are given in tables 2.2.1 and 2.2.2 respectively. These 13 
components are dominated by some variables in their respective direction. Variables for 
each component is selected usmg aforesaid criterion. From table 2.2.1 followin<> 
variables are chosen for campaign no. 1\2 

1 . Hot metal temperature 

2. Blow Oxygen 

3. Tap-Tap time 

4. Average Lime addition 

5. Ore addition 

6. Dolo addition 

7. Bath Carbon 

8. %MnO in slag 

9. %CaO in slag 

10. %MgO in slag 

11. %FeO in slag 

12. Basicity in slag 

13. Phosphorus in Hot metal 

Using table 2.2.2, following variables are chosen for campaign no. 1\4 following 
variables are chosen 

1 . Phosphorus in Hot metal 

2. Sulphur in Hot metal 

3. Hot metal temperature 

4. Blow oxygen 

5. Tap-tap time 

6. Average Lime addition 

7. Ore addition 

8. Bath Carbon 

9. %MnO in slag 

10. %CaO in slag 
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11. %MgO in slag 


12. %FeO in slag 

13. Slag Basicity 

Similarly, for all campaigns ICA is performed and variables are selected. Final 
list of chosen variables for ICA is given in table 2.2.3 . 



CHAPTER 3 


CLUSTERING TECHNIQUES 


Group or Cluster analysis is a primitive technique in that no assumptions are 
made concerning number of groups or the group structure. Grouping is done on the 
basis of similarities or distances. 

The basic objective of cluster analysis is to discover natural groupings of the 
items (or variables). In turn, we must first develop a quantitative scale on which to 
measure the association (similarity) between objects. 

Most efforts to produce a rather simple group stmcture firom a complex data set - 
necessarily require a measure of “closeness” or “similarity”. There is often a great deal 
of subjectivity involved in the choice of a similarity measure. Important consideration 
include the nature of variables (discrete, continuous, binary) or scales of measurements 
(nominal, ordinal, interval, ratio ) and subject matter knowledge. 

When items (units or cases) are clustered, proximity is usually indicated by some 
sort of distance. On the other hand, variables are usually grouped on the basis of 
correlation coefficients or like measure of association. 

As we know, the Euclidean distance between two p-dimensional 
observations(items) . 

x=[xi,X2, ,Xp ]’ and y =[yi,y2 ,yp] is given by 

d(x,y)= HX,-Y,y-+. .+(;?, -y,)=) = V((^-r)'(A'-r)) 

The statistical distance between the same two observations is of the form 
dix,y) = ^(<iX-r)'A(.X-r)) 

Ordinarily, the entries of A'^ are sample variances and covariances. However, 
without prior knowledge of the distinct groups, these quantities cannot be computed. 
For this reason. Euclidean distance is often preferred for clustering. 



Another distance measure is the Minkowski metric 
d(x,y) = 

For m=l, d (x, y) measures the “city-block” distance between two points in p 
dimensions. For m— 2, d (x, y) becomes the Euclidean distance. In general, varying m 
changes the weight given to larger and smaller differences. 

Whenever possible, it is advisable to use “true” distances, that is, distances satisfying 

the following distance properties 

d(x,y)=d(y,x) 

d(x,y)>0 ifx^y 

d(x,y)=0 if x=y 

d(x,y)<d(x,z)+d(z,y) triangle inequality 

There are various clustering methods available, only a few are dealt in the thesis. - 
Clustering is done using statistical methods as well as neural network based algorithms. 

In statistical methods, K-means clustering method is used, which uses Euclidean 
distance as similarity measure .In neural algorithms, Kohenon's Self Organising Map 
(som), and Adaptive Resonance Theory networks (ART) are used. 

3.1 K-means clustering method 

K-means method is a non -hierarchical clustering method, which is designed to 
group items, rather than variables, into a collection of K -clusters, the number of 
clusters K is specified in advance. 

MacQueen suggests the term K-means for describing his algorithm that assigns 
each item to the cluster having the nearest centroid (mean). In its simplest version, the 
process is composed of these three steps. 

1. Partition the items into K initial clusters. 

2. Proceed through the list of items, assigning an item to the cluster whose centroid 
(mean) is nearest. (Distance is usually computed using Euclidean distance) Recalculate 
the centroid for the cluster receiving the new item and for the cluster losing the item. 

3. Repeat Step2 until no more reassignments take place. 


-I'.l 





Rather than starting with a partition of all items into K preliminary groups in 
Step 1, we could specify k initial centroids called seed points and then proceed to Step 
2 . 

The fibaal assignment of the item to the cluster will be, to some, extent, 
dependent upon the initial partition or the initial selection of the seed points. Experience 
suggests that most major changes in assignment occur with the first reallocation step. 

The greatest drawback of K-means clustering method is the prior determination 
of number of clusters K. There are strong arguments for not fixing the number of 
clusters, K, in advance. These include the following 

l.If two or more seed points inadvertently lie within a single cluster, their resulting 
cluster will be poorly differentiated. 

2. The existence of an outlier might produce at least one group with veiy disperse items. 

3. Even if the population is known to consist of K groups, the sampling method may be 
such that data firom the rarest group do not appear in the sample. Forcing the data into K 
groups would lead to nonsensical clusters. 

In cases, where a single run of the algorithms requires the user to specify BC, it is 
always good idea to rerun the algorithm for several choices. 

3.1.1 Results of K-means clustering 

For all the clustering techniques, only input features (variables) have been used 
for grouping of samples. However, for easy understanding of the results, actual no. of 
runs of converter has been used in preparation of tables. 

(i) for ICA data 

K-means algorithm has been applied for 4 values of k. In table 3.1.1 results for 
all 4 values of k is presented. For k=2, actual no. of runs for converter 563,499,595 and 
937 form one group while remaining form other group. Interesting point is that both 
lowest and highest no. of runs belong to same group. It is the case with for all 4 values 
of k. but the reason behind this cannot be investigated here. For k=2, 772, 560, 539, 
532, 662, 607, 712, 724, 746, 546 and 652 belong to same group, but as the value of k is 
increased, refinement in partition takes place. For k =3, 772, 560, 662, 724, 746, and 
652 form a different cluster. Another interesting observation is that for k=3, 595 belongs 




table 3.1.2: results of k-means clustering 
for pea data 








































































































































































to the group containing 539, 532etc. But for k=4 and 5, it once again comes back to the 
group containing 499, 937 etc. For k=5, 746 form a lone cluster. For Ic=2 and 3, 607 was 
not in the 772, 560 group, but for k=4 and 5, this sample belongs to the cluster 
containing these samples. 

(ii) for PCA data 

For PCA data also k-means algorithm has been applied for 4 values of k. In table 

3.1.2 the results of all 4 values of k are presented. For k=2, 563,499,937,595 and 546 
belong to one group and remaining form other group. For k=3, 546 and 595 no longer 
belong to their old group and enter the group which contains 539,532,607 and 712. 
563,499, 937 once again form same cluster. Same is the case with converter lives 
772,560,662,724,746 and 652. An interesting observation here is that for k^ and 5, 595 
comes back to its old cluster i.e. cluster belonging to 499,937 and 563. For these two 
values of k, 772 form a lone cluster. 532,539,546 and 712 belong to other cluster. For 
k=4, 560,607,662,724,746 and 652 belong to one group but for k=5, 652,662,724,746 
form another cluster and only 560 and 607 are left to previous one. 

(iii) for rd data 

The result of k-mean clustering for rd data is presented in the table 3.1.3. From 
this table it is revealed that though for fc=2, 595 and 607 belong to the group having 
converter lives 563,499 and 937 but for other values of k they are not in that group. 
They themselves are always in one group. For k=3 and 4, 772, 560, 662, 724, 746 and 
652 belong to same group but for k=5, 772 and 560 form other cluster. For k=4 and 5, 
712, 539,546 belong to same group. 

3.2 Fuzzy C-means Clustering 

Bezdek developed an extremely powerful classification method to accommodate 
fuzzy data. It is an extension of method known as C-means (or K-means) clustering, 
when employed in a crisp classification sense. To introduce this method, we define a 
sample set of n data samples that we wish to classify: 

X={Xi,X2, ,Xii} 

Each data sample, Xi is defined by m features, i.e. 

X| = {XiiPCn^Xis, >Xini} 



where each Xi in the universe X is an m-dimensional vector of m elements or m features. 
Since the m features can have dijfferent units, in general, we have to normalize each of 
features to a imified scale before classification. In a geometric sense, each xt is a point in 
m-dimensional feature space, and the universe of the data sample, X, is a point set with 
n elements in the sample space. 

In Fuzzy C-mean (FCM) clustering method, we define a famil y of fuzzj’ sets { 

Ai , i = 1,2, ,c }as a fuzzy c-partition on a universe of data points, X. Because 

fuzzy sets allow for degrees of membership we can extend the crisp classification idea 
into a fuzzy classification notion. Then we can assign membership in more than one 
class. It will be useful to describe the membership value that the data point has in the 
i*^ class with the following notation: 

Pik =PAi(Xk) e [0,1] 

with the restriction (as with crisp classification) that the sum of all membership values 
for a single data point in all of the classes has to be unity: 

ip,. =1 for all k= 1,2, ,n 

i=i 

As in crisp classification, there can be empty classes and there can be no class that 
contains all data points. This qualification is manifested in the following expression: 

n 

0< <n 

k=l 

Because each data point can have partial membership in more than one class, i.e. 

PikAPjk=?^0 

and 

vi^A,(Xk) = l for all k 

0 < ZirA,(Xk)<o foralll 

k«l 

We now define a family of fuzzy partition matrices, Mfc , for the classification involving 
c classes and n data points, 

Mfc={UlPi, €[0,l];ipi,=l;0<i;Pik<n } 



where i = 1,2,3, c and k = 1,2,3, n . 

■Any U e Mfc is a fiizzy partition, and it follows from the overlapping character 
of the classes and infini te number of membership values possible for describing class 
membership that the cardinality of is also infinity. 

To describe a method to determine the fuzzy c-partition matrix U for grouping a 
collection of n data sets into c classes, define an objective function Jm for a fuzzy c- 
partition, 

k=I 1=1 

where 

and where pik is the membership of data point in i* class. 

The function Jm can have a large number of values, the smallest one associated 
with the best clustering. Because of large number of possible values, now infinite 
because of the cardinality of fuzzy sets, we seek to find the best possible, or optimum, 
solution without resorting to an exhaustive, or expensive, search. The distance measure 
dik is a Euclidean distance between the i* cluster center and the k* data set (data point in 
m-space). The parameter m’, is called weighting parameter, which has a range m' 6 
[ 1,00 ). This parameter controls the amount of frizziness in classification process. The 
vector Vi is the i* cluster center, which is described by m features ( m coordinates) and 
can arranged in vector form as, Vi = {vii,Vi 2 ,Vi 3 ,....Vjin}. 

Each of the cluster coordinates for each class can be as follows 


k=l 



k=l 


where j is a variable on feature space i.e., j -1,2, m. 

Optimum fuzzy c-partition will be the smallest of the partitions described by 
function Jm , i.e. 
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Jm(U*,v*) = ininj(U,v) 

Mfc 

Here we seek the best solution available within a pre-specified level of accuracy. 
The following algorithm was proposed by Bezdek for this purpose. 

Algorithm 

[i] Fix c (2<c<n) and select a value for parameter m’. Initialize the matrix Each 

step in this algorithm will be labeled r, where r = 0,1,2, 

[ii] Calculate the c centers } for each step. 

[iii] Update the partition matrix for the r* step, as follows : 


= 


s 

J=1 








1-1 


for Ik = <t) 


or 

= 0 for all classes i where i s 1^. 

where 

I,={i |2<c<n;d«=0 } 
and 

ik = {u c}-i^ 

and 

isik 

[iv] If < s, , stop; otherwise set r = rf 1 and return step 2. 

In step [iv] we compare the matrix norm |1 1| of two successive fuzzy partitions to 
a prescribed level of accuracy, Si to determine whether the solution is good enough. 

3.2.1 Results of Fuzzy C-means clustering 

Fuzzy C- means clustering (FCM) has also been applied for 4 values of c, 
namely 2,3,4 and 5. In tables 3.2.1 to 3.2.6 tihe membership value of the data samples 
for c=4 and 5 is given and tables 3.2.7 to 3.2.9 contain only cluster number to which 
sample belongs with maximum membership. 

(i) for ICA data 
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table 3,2.5 : results of fuzzy C-means clustering 
for pea data (c=5) 
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The results of c-means clustering method are presented in tables 3.2.1, 3.2.4 and 
3.2.7. In table 3.2.1 we have taken c=4 and the membership with which a data sample 
belongs to a particular group is given against corresponding converter life. For c=5, 
similar results are presented m table 3.2.4. In table 3.2.7, we have awarded the sample to 
that cluster to which it belongs with highest membership (c varies from 2 to 5). From 
table 3.2.1 it is clear for some data samples that they belong to other groups with a 
significant membership. Consider the sample for which the converter life is 539, it 
belong to cluster 1 with the membership of 0.6317 and cluster2 with a significant value 
of 0.1976. But some samples clearly belong to a particular cluster with very high 
membership value such as 560, it belongs to cluster 2 with the membership Of 0.957. 
For c=5, sample having life 595 runs belong to cluster 1 with membership Of 0.527, 
cluster 4 with 0.1871 and cluster 5 with 0.1427. This represents the fuzziness in the 
available data. 

(ii) for PCA data 

The results of PCA data are presented in tables 3.2.2, 3.2.5 and 3.2.8. In tables 
3.2.2 and 3.2.5 c is 4 and 5 respectively. The membership of belongmgness of a pattern 
to a particular cluster is given against the corresponding converter life. In table 3.2.8 
crisp sort of clustering is given. The pattern is awarded to a cluster to which it belongs 
with maximum membership. In this set we see that the sample having converter life 662 
runs belongs to cluster 5 with 0.365 membership and cluster 1 with 0.2457 and it is 
awarded to cluster 5. Similarly, the sample having converter life of 563 nms belongs to 
cluster 4 with the membership 0.5058 and cluster 5 with 0.2196. Therefore we see that 
is really dijQttcult to have a clear cut partition surface for this case. 

(iii) for RD data 

The results of FCM clustering method for RD data is presented in tables 3.2.3, 
3.2.6 and 3.2.9. In table 3.2.3 the number of clusters c is kept at 4. The membership of a 
particular pattern for a particular cluster is given against corresponding converter life. In 
table 3.2.6 gimilar result is presented for c=5. In table 3.2.9 the pattern has been 
awarded to the cluster to which it belongs with maximum membership. From table 



3.2.6, it is seen that sample having life 563 runs belong to cluster 5 with only 0.4185 
membership value and cluster 4 with 0.2601. Similarly 532 and 607 have maximum 
belongingness much less than 0.5. Therefore here also we can say that a clear-cut 
partition surface can not be created for the available data set. In fact, all three data sets 
are not accurate. These are a good example of fuzzy sets. 

3.3 Kohenon Self-Organizing Network 

The term self-organization refers to the ability to leam and organize information 
without being given correct answers for input patterns. Thus, self-organizing networks 
perform unsupervised learning. 

The Kohenon network consists of single layer of nodes (plus an input layer). 
Each node receives input from the other nodes within layer. When we build a kohenon 
network, it is important to properly initialize the weight vectors of the nodes. It is also 
advised to normalise both input vectors and weight vectors to a constant (typically unit 
length). Each node computes by taking the dot product of its weight vector and the input 
vector. The result reflects their similarity (or distance). Symbolically, 

Oj = X.Wj 

where Oj is activation level of unit j, X is the input vector, Wj is the weight vector of 
unit j. Suppose we place all weight vectors in a matrix called a matrix W, and let the 
vector O represent the activations of ail nodes. Then we obtain 

o=xw 

Kohenon network can perform various fimctions namely : clustering, learning, 
statistical modelling, and topology preservation. Our mterest lies only in clustering 
provided by kohenon network, therefore we will discuss only clustering. 

Clustering 

Clustering is concerned with the grouping of objects (input patterns) according 
to their similarity. Now we discuss how kohenon network can perform clustering 
through a competitive lea rnin g mechanism called“ winner takes all ”. 

In essence, the node with largest activation level is declared the winner in the 
competition. This node is the only node that will generate an output si gn a l and all other 



nodes 3xe suppressed to the zero activation level. Furthermore, this node its neighbours 
are the only nodes permitted to leam for current input pattern. 

The kohenon network uses interlayer connections to moderate connections to 
moderate this competition. The output of each node acts as an inhibitory input to other 
nodes but is actually excitatory in its neighbourhood. Thus, even though there is only 
one winner node, more than one node are allowed to change their weights. This 
complex scheme for moderating competition within a layer is known as lateral 
inhibition. The mhibitory effect of a node can also decrease also with the distance from 
it. The exact size of neighbourhood varies as learning goes on. It starts large and is 
slowly reduced, making the range of change sharper and sharper. To simulate lateral 
inhibition, we simply take the winner. This avoids the complexity of truly implementing 
this mechanism. 

After training, the weight vector of each node encodes the information of a 
group of similar mput patterns. Given an input vector, it is assigned to the node with 
maxi mum activation. Since the number of nodes is fixed, the net algorithm is similar to 
the k-means clustering algorithm. This kind of algorithm is more noise tolerant than 
algorithms which do not specify the number of clusters in advance, such as ART 
networks. Results, however may depend on the presentation order of input data for a 
small amount of training data. 

Learning 

The w inning node and its neighbours will leam by adjusting their weight vectors 
according to following rule: 

Wnew = Wold+ Tl(x-W,„) 
where X is the input vector and t] is the learning rate. 

Since the winner’s weight vector generates the largest dot product with the input 
vector, it means that the w innin g weight vector is closest to the input vector. Kohenon 
learning is to make the w innin g weight even more similar to input vector. As learning 
proceeds, the size of nei^bouriiood is gradually decreased. Fewer and fewer nodes can 
leam in each iteration, and finally, only the w innin g node learns. 




table 3.3.2 : results of kohenon som for ica data 
(no. of output neurons=5) 















































































































































Kohenon network uses single pass learning rather than multi pass feedback and 
is potentially fast. This fact suggests its suitability for real time applications. 

Algorithm 

• First, a winning neuron is selected as the one with shortest Euclidean distance 



between its weight vector and the input vector, where Wj denotes the weight vector 
corresponding to the i* output neuron. 

• Let i denote the index of the winner and let I denote a set of indices corresponding 
to a defined neighbourhood of winner i . Then the weights associated with the 
wiimer and its neighbouring neurons are updated by: 

AWj = Ti A(j,i')(x-wJ foralljel* 
where neighbourhood function a (j,i*) may be chosen as 



where rj represents the position of the neuron j in the output space. The convergence of 
feature map depends on a proper choice of r] . One plausible choice is that t] = ^ . The 
size of neighbourhood (or a ) should decrease gradually. 

• The weight update should be immediately succeeded by the normalisation ofwj . 
3.3.1 Results of Kohenon SOM 

Results of Kohenon SOM are presented in tables 3.3.1 to 3.3.6. Here we varied 
the number of output neurons (i.e. no. of clusters we want) and nei^bourhood size. 
Neighbourhood size is playing a major role in classification. Learning rate alpha is 0.2 
for all cases. 

(i) for ICA data 

The results for ICA data are presented in the tables 3.3.1 and 3.3.2. In table 3.3.1 
neighbourhood size is kept fixed at 1. In table 3.3.2, we have kept no. of output neurons 
fixed at 5 and varied the neighbourhood size from 0 to 4. With reference to table 3.5.1, 
For 2 output neurons the samples having lives 772, 560, 662, 607 and 546 belong to one 



belongs to cluster 



table 3.3.4 : results of kohenon som for pea data 
(no. of output neurons=6) 













































































































































cluster and the rest belong to another. For 3 output neurons, data samples having lives 
724, 772, 746 and 662 belong to same cluster. Another cluster is formed for samples 
having lives 563, 499, 937, 595, 532, 712, and 546. Third cluster contains samples for 
which lives are 560, 539, 607 and 652. For 4 output neurons case, the winner neuron for 
lives 772 and 560 is 1. Neuron no. 2 wins for samples having lives 563, 499, 937, 607 
and 546. Samples having lives 595, 539, 532 712 and 652 form a cluster and another 
cluster is formed for the samples for which lives are 662, 724 and 746. For 5 output 
neuron case slight modification takes place in partition. 546 now belongs to the group 
having lives 595, 539, 532 and 607 and 712 and 652 leave this group. All these results 
are given in table 3.3.1. In table 3.5.2 when we vary the neighbourhood size we see that 
partition line also changes even for same no. of output neurons. This may also happen 
that a particular neuron may not win even for single instance. Therefore it is very 
necessary to choose a suitable neighbourhood size for true partition. 

(ii) for PCA data 

The results of Kohenon SOM for PCA data are given in tables 3.3.3 and 3.3.4. In 
table 3.3.3 neighbourhood size is kept fixed at 1 and no .of output neurons is varied. In 
table 3.3.4 neighbourhood size is varied fi-om 0 to 4 for 5 output neurons. As observed 
from table 3.3.3, for 2 output neurons (i.e. when we want 2 clusters) the patterns for 
which converter lives are 772, 560, 539, 607 and 546 runs form one group and rest form 
another. When no. of output neurons is increased to 3, modification in partition surface 
takes place. Patterns having converter lives 563, 499, 937, 595, 532, 607 and 546 runs 
belong to one group. Another group contains lives 772, 662, 724 and 746. Rest of the 
patterns form third group. This classification may vary if we change the neighbourhood 
size. As is seen firom table 3.3.4 , when neighbourhood size is 0 neuron the partition was 
such that converter lives 772, 560 were in one group. Another group 563, 499, 9 j 7 and 
595 runs. As we change the initial neighbourhood size to 1 neuron, 937 entered the 
group having lives 772 and 560. Similarly for nei^bourhood size of 2 neurons 712 also 
enters this group while 652 leaves it. 

(iii) for RD data 




fable 3.3.6 ; results of kohenon som for rd data 
(no. of output neurons-5) 


























































































































The results for RD data are presented in tables 3.3.5 and 3.3.6. In table 3.3.5 
neighboiirhood size is kept fixed at 3 and no. of output neurons are varied. In one case it 
is 4 while in other it is 5 neurons. In table 3.3.6 we have varied neighbourhood size 
keeping number of output neurons fixed at 5. The clusters are almost similar in both the 


cases in table 3.3.5 except two differences. 595 makes a separate cluster in the case with 
5 neurons. Also 539 and 607 leave their group and are merged in the group containing 
samples having lives 499 and 937 runs. As we vary the neighbourhood size (from 0 
nexiron to 4 neurons) in the case of 5 output neurons, we observe that first three case 
give same result. Changes occur only in last two cases. So we can say here that a good 
partition takes place when neighbourhood size is much smaller than number of output 
neurons. 

3.4 ART2 (Continues valued ART) 

ART2 is widely used clustering technique for analog or continuous valued . 
patterns. The patterns are classified or clustered with the accuracy defined by a factor 
called “vigilance factor”. Smaller the vigilance threshold larger the number of clusters 
generated Their ability to generalise is limited, however, the ability of this network to 
create new pattern class in its knowledge base on the arrival of a novel pattern makes it 
very suitable for clustering (i.e. these networks are unsupervised). The classification is 
dependent on the order of presentation of input patterns. 

Let X and wj denote the input vector and the weight of neuron j respectively. The 
criterion of selecting the winner is based on minimum distance measure (e.g. Euclidean 
or other distance). 

Algorithm 


Given a new training pattern, a winner neuron is selected with min i m u m distance 
criterion. Suppose the winner neuron is j , then 


x-Wj.11^ 


x-w 


for all j 


where 1| . || denotes the distance metric (Euclidean or any other distance). 
• Vigilance test'. A neuron j* passes the vigilance test if and only if 
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table 3.4.2 : results of ART2 network for pea data 

(vigilance parameter is varied from 2.5 to 5) 














where the vigilance value p determines the radius of cluster. 

• If the wmner fails the vigilance test, a new neuron uni t k is created with weight 


= X. 


• If the winner passes the vigilance test, adjust the weight of winner j* by 


^,new _ 


x + w^** 

clusterj^** 

Ijclusterj^** 

+ 1 


where j|cluster; [j denotes the number of members in cluster , 


3.4.1 Results of ART2 

For all three data sets, vigilance factor is varied between 2.5 to 5.0 . For aU three 
data sets, ail the variables are scaled between 0 and 1. In table 3.4.1, result for ICA data 
is given. In tables 3.4.2 and 3.4.3 results for PCA data and RD data are presented. The 
number of cluster reduces as vigilance factor is increased. 

(i) for ICA data 

For vigilance factor 2.5, there are 7 clusters present. For vigilance up to 3.4, 
there is no change in clustering. For vigilance 3.5, there are 6 clusters. The no. of 
cluster further reduces to 5 and 4 for vigilance factors 3.6 and 3.7 respectively. Then up 
to vigilance 4.6 there are 4 clusters. For 4.7 and 4.8 there are 3 clusters and for 
vigilance factors 4.9 and 5.0 there are only two clusters available. For all the values of 
vigilance factors, samples having lives 772,560 and 662 belong to same group. For a 
s ma ll value of vigilance factor, there are many groups that contain only one member. In 
contrast to all previous methods discussed, in ART2 clustering 499 and 937 belong to 
different group. According to this method, 499 belong to altogether a different cluster. 
Even for vigilance as high as 4.6 it forms a separate group. For vigilance 4.7 and above 
563,499,937 and 595 belong to same cluster as happened wth both K-means and Fuzzy 
C-means clustering. For all values of vigilance fector, 652,662, 724 and 746 belong to 
same group. Similarly, 539, 546 and 712 are always in same group. 

(U) for PCA data 

For vigilance factor 2.5, there are 9 clusters in this case. The classification 
remains same till vigilance is 2.8. Here 772 and 560 belong to one cluster. 563, 499, 
937, 595, 539, 607 are the only members of their respective clusters. 532 and 712 form 







one cluster and similarly, 662, 724, 746 and 652 belong to another cluster. As vigilance 
is varied to 2.9, modification in classification takes place and there are now 8 clusters. 
This is the case till vigilance is 3.1. 532 and 595, which were forming separate clusters 
previously, are now merged together. Rest of the classification is similar to the previous 
case. For vigilance 3.2 to 3.9, there are 6 clusters. There must have been a classification 
boundary between 3.1 and 3.2 that we did not notice. Here, 607, which was forming a 
separate cluster previously, is now merged in the group containing 772 and 560. 563 
and 937 are merged together. 712 is now in the 539, 595 groiqi. For vigilance 4 and 4.1 
there are 5 clusters. In cluster no. 1, there are samples for which converter lives are 
772,560, 662, 712, 724 and 652. Cluster no. 2 contains data samples having lives 563 
and 937. In cluster no. 3 only 499 is present. In cluster no. 4 there are samples having 
lives 595 and 532. Samples having lives 712, 539 and 546 belong to cluster no.5.For 
vigilance firom 4.2 to 5.0 there are 4 clusters. No. of clusters could have been further 
reduced if we had increased the vigilance factor. 

(iii) for RD data 

Table 3.4.3 contains the results of ART2 applied for RD data for vigilance 
factors 2.5 to 5.0 with a step of 0.1. The number of cluster varies from 10 to 2 for given 
values of vigilance factors. For vigilance factors 2.5 and 2.6, there are 10 clusters. All 
the data samples form separate cluster except the samples having lives 652, 662. 724, 
746 that form a cluster and the samples having lives 539, 546, 532 and 712.539 and 546 
belong to one cluster while 532 and 712 belong to another. For vigilance factor 2.7 there 
are 9 clusters. The two groups which contained converter lives 539,546 and 532,712 are 
now merged. Rest of the classification remains the same. For vigilance factor 2.8, there 
are 8 clusters. Samples having lives 772 and 560 now belong to same cluster. For 
vigilance 2.9, da ta set is divided into 7 clusters. For vigilance 3 and 3.1 there are 6 
clusters whereas for 3.2 to 3.4 data set contains 5 clusters. For vigilance factors 3.5 to 
3.8 there are 4 groups. Though for vigfiance 3.8, no. of cluster remains 4, but data 
partition is diff erent as compared to others. This is only because of vigilance factor. For 
vigilance 3.9 to 4.5, data set is partitioned in to 3 clusters. For remaining vigilance 
factors there are only two clusters. 



3.5 FUZZY ART 

Adaptive Resonance Theory (ART) networks are most usefiil for pattern 
clustering, classification, and recognition. Their ability to generalise is limited, however, 
the ability of these networks to create a new pattern class in its knowledge base on the 
arrival of novel pattern makes them very useful. These networks can work on binary or 
analog value patterns. 

Fuzzy ART is a form of ARTl that incorporates fuzzy logic operations. 
Although ARTl can leam to classify only binary input patterns, fuzzy ART can leam to 
classify both analog and binary input patterns. 

In fuzzy ART, input vectors are normalised at a preprocessing stage. This 
normalisation procedure, called complement coding, leads to a symmetric theory in 
which the MIN operator (a) and the MAX operator (v) of fuzzy set theory play 
complementary roles. 

Algorithm 

ART field activity vectors - Fuzzy ART system includes a field Fo of nodes that 
represent a current input vector; a field Fi that receives both bottom-up input from Fo 
and top-down input from field Fi that represents the active code, or category. The Fo 

activity vector is denoted I = (IiJa, with each component li in the interval 

[0,1], i =1,2, M. The Fi activity vector is denoted x = (xi,X 2 , ,xm) and the F 2 

activity vector is denoted y = (yi,y 2 , ,J^). 

Weight Vector - Associated with each F 2 category node j (j = 1,2, .N) is a vector 

Wjs(wji, ,WjM) . Initially, 

Wji(O) = Wj2(0) = = WjM(O) = 1. 

Parameters - Fuzzy ART dynamics are determined by a choice parameter a>0; a 
learning rate parameter |3 e[0,l] ; and a vigilance parameter p e [0,l] . 

Category choice - For each input I and F 2 node], the choice function Tj is defined by; 

IAW; 

rr* / -rN 

j 




a + 


where the fiizzy AND operator a is defined by ; 



(pAq), =min(p.,q,) 
and where the norm |.t is defined by : 

IphElp.l 

iai 

for any M-dimensional vectors p and q. For notational simplicity, Tj(I) is often written 
as Tj when the input I is fibced. 

The system is said to make a category choice when at most one F 2 node can 
become active at a given time. The category choice is indexed by J, where Tj = max{ Tj 
:j = 1,2, .N}. 

If more than one Tj is maximal, the category] with the smallest index is chosen. 

In particular, nodes become committed in order] = 1,2,3, When the J* category is 

chosen, yj = 1; and yj = 0 for] 9^ J. In a choice system, the Fi activity vector x obex’s the 
equation 

_ 1 1 if Fa is inactive 
^ “ I IaWj if the ]* node is chosen 


Resonance or reset - Resonance occurs if the match function 



of the chosen 


category meets the vigilance criterion : 

Ia Wj 

i.e. when the ]* category is chosen, resonance occur if 
lxl = |lAWj|>pili 
Mismatch occur if 

|l A Wjj 

i.e. if 

W=|iawj|< p!il 

Then the value of the choice function Tj is set to 0 for the duration of the input 
presentation to prevent the persistent selection of same category during search. A new 
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table 3.5.2 : results of fuzzy ART network for ica data 
vig=0.5,beta!=0.6, alpha is varied 





















































































































































































































































































































index J is then chosen. The search continues until the chosen J satisfies resonance 
condition. 

Learning - Once search ends, the weight vector w j is updated according to the equation 

wr=p(i^wf)+(i-p)wf 

Input' normalisation / complement coding option - Proliferation of categories is avoided 
in fuzzy ART if inputs are normalised rule that preserved amplitude information. To 
define this operation m its simplest form, let a represent on response, then complement 
of a, denoted by a^ represents the off-response, where 
a'^l-a, 

The complement-coded input I to the field Fi is the 2M- dimensional vector ; 

I ~ (^5 ^ )~ (^1 > ’ ’^ m ) 


Note that 


1 = 


(a,a=)l 


M / M '' 

= Lai + M-La, 

\ isl > 




= M 

so inputs are preprocessed into complement coding form are automatically normalisecL 
Where complement coding is used, the initial weight vectors are given by 
Wji(O) = = Wj.2M= 1 

3.5.1 Results of Fuzzy ART network 

The results of Fuzzy ART network are presented in tables 3.5.1 to 3.5.6. The 
vigilance factor and alpha are varied over range [0.1,1]. For all three data sets alpha is 
having no effect on clustering. As expected, vigilance factor plays major role. Larger the 
vigilance factor, larger is the number of clusters created. 

(i) for ICA data 

Results of clustering obtained by Fuzzy ART algorithm are presented in tables 

3.5.1 and 3.5.2. In table 3.5.1 we have varied vigilance factor firom 0.1 to 1 and in table 

3.5.2 alpha is varied over the same range. For vigilance value 0.1 and 0.2 we have two 
clusters. All the samples except the one which is having converter life 652 belong to 
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table 3,6.4 : results of fuzzy ART network for pea data 
vig=0.6,beta=0.6. alpha is varied 

























































































































































































































































































































■o 

o 


<D 

O 

c 

CO 


ir -5* 
< > 
CD 

*5 CO 


o <i> 
<0^ 


— lo 


3 

CO o 

^ It 

x: 
Q. 

« 15 

CO 


in 



table 3.5.6: results of fuzzy ART for rd data 
vig=0.6, beta=0.6, alpha is varied 














































































































































































































































































































single cluster. Though for vigilance factors 0.3 and 0.4 we have 3 clusters but line of 
partition in both the cases is different. Similar effect is seen for vigilance values 0.5 and 
0.6. As we move from vigilance 0.6 to 0.7, the number of clusters increases from 4 to 7. 
The number of clusters are 12, 14 and 15 for vigilance factors 0.8, 0.9 and 1.0 
respectively. For vigilance value 1 .0, in all the cases the number of clusters created will 
be same as no. of data samples available. 

(ii) for PC A data 

The results of Fuzzy ART network for PC A data are presented in the tables 3.5.3 
and 3.5.4 . Initially, for vigilance value 0.1 and 0.2, set is divided in 3 clusters and then 
no. of clusters gradually increases with increase in vigilance factor. 

(iii) for RD data 

The results of Fuzzy ART network for RD data are presented in the tables 3.5.5 
and 3.5.6. For vigilance factor 0.1, we have two clusters. For vigilance values 0.2, 0.3 
and 0.4, data set is divided into 3 clusters. The no. of clusters available is 5 for 0.5 
vigilance and 6 for 0.6. For 0.8 vigilance, we have 11 clusters and for 0.9 and 1.0, the 
number of clusters is 15. In most of the cases we see that in this clustering, last data 
sample is fo rmin g a separate cluster. This may be because of the order of presentarion of 
input patterns. Because in this case order of presentation of patterns matters a loL As in 
the case with ICA and PCA data, alpha is having no effect on clustering. 

3.6 IDS Algorithm 

ID3 is a machine learning algorithm which generates decision rules from a set of 
training examples. Each example is represented by a list of features. 

ID3 uses a tree representation for concepts. To classify a set of examples, we 
start at the top of the tree and answer the questions associated with the nodes in the tree 
until we reach a leaf node, where the classification or decision is stored. 

ID3 uses information theory to select features which give the greatest 
information gain or decrease of entropy. Entropy is defined as -plog 2 P, where p is 
determined from frequency of occurrence. 

The ED3 algorithm is stated as follows; 



• Let N be total number of learning examples, and Ni tlie number of examples that 

belong to class i , i ~ 1,2,3, C , where C = total no. of classes. The information 

entropy for the problem consisting of N examples is 

entropy(l) = |:-|-I„g,(^) 

• When a feature test is performed on feature Ak, k = 1,2, ,K, all examples are 

divided into J subsets, where feature Ak has J values. Assume that there are 
n^ 5 (i) examples in subset j belonging to class i and the total number of examples in 

subset j is n^, = 2ni,(i) 

i 

Then the entropy of feature is calculated as 

J jj 

entropy{l,Ak) = J j) 

^.( 0 , 

log, 

• The new level of decision tree is built by adding the leaf nodes resulting from 
testing feature Aj^, where the test on feature A^ results in the maximum 
information gain : 

max|entropy(l) -entropy(l,Aj,)| 

• Testing different features and growing a decision tree should continue until all leaf 
node contain examples of single class only. The corresponding entropy is then 

entropy(l,A|.,Afc, A^) = 0 

where the decision tree consists of L (L ^ K) levels and corresponding feature test 
at level 1 is A^ . 

Finally, we obtain a decision tree, which can be described in terms of 
the hierarchical decision rules. The examples belonging same leaf node belong to same 
cluster. 

3.6.1 Results of IDS Algorithm 


j=l 1*1 



0.863 

0.876 



table 3.6.1 : results of ID3 algorithm for ICA data 












Since the core part of this algorithm is entropy calculation, we are required to 
convert the input to the suitable form. Therefore each variable in every data set is 
divided into three ranges depending upon the variation in them. The converter life is 
divided into two classes (i) when the life is below 650 runs (ii) when life is more than 
650 runs. The input to the EDS algorithm is given in table A5 for ICA data, in table A6 
for PCA data and in table A7 for RD data. 

(i) for ICA data 

The result of ID3 algorithm is presented in the table 3.6.1. Here classification 
takes three steps. Following decision rules are formulated 

[1] . If mean blow oxygen is in range 6 then corresponding class of converter life is 50. 

[2] . If mean blow oxygen is in range 4 and mean MhO is in range 38 then corresponding 
class for converter life is 70. 

[3] . If mean blow oxygen is in range 4 and mean MnO is in range 37 and mean basicity . 
is in range 26 then corresponding class of converter life is 50. 

[4] . If mean blow oxygen is m range 4 and mean MnO is in range 37 and mean basicity 
is m range 25 then corresponding class of converter life is 70. 

[5] . If mean blow oxygen is in range 5 and mean MnO is m range 37 and mean basicity 
is in range 26 then corresponding class for converter life is 50. 

[6] . If mean blow oxygen is in range 5 and mean MnO is m range 37 and mean basicity 
is m range 25 then corresponding class for converter life is 70. 

These ranges for variables is given in A5. 

(ii) for PCA data 

The result of IDS algorithm for PCA data is presented in the table 3.6.2. Input to 
the algorithm is given in table A6 with corresponding ranges. 

Decision takes place in 3 steps and following decision rules are formulated. 

[1] . If mean Si in hot metal is in range 4 then corresponding class for converter life is ^ 

50. I 

[2] - If mean Si in hot metal is in range 6 then corresponding class for converter life is f 

50. I 

f 


I 



0.4327 

0.4437 

0.6667 

0.691 



toble 3.6.2 ; result of IDS alaorllhm for PCA data 











these are sorted features 









[3] . If mean Si in hot metal is in range 5 and mean blow oxygen is in range 12 then 
corresponding class for converter life is in range 50. 

[4] . If mean Si in hot metal is in range 5 and mean blow oxygen is in range 10 and mean 
basicity is in range 35 then the corresponding class for converter life is in range 50. 

[5] . If mean Si in hot metal is in range 5 and mean blow oxygen is in range 11 and 
mean basicity is in the range 35 then corresponding class for converter life is in range 
70. 

[6] . If mean Si in hot metal is in range 5 and mean blow oxygen is in range 1 1 and mean 
basicity is in range 34 then the corresponding class for converter life is in range 50. 

(Hi) for RD data 

The result of ID3 algorithm for RD data is presented in the table 3.6.3. The input 
to the algorithm is given m A7. Classification takes place in 3 steps and following 
decision rules are formulated. 

[1] . If FeO in slag is in range 36 then the corresponding class for converter life is 50. 

[2] . If FeO in slag is in range 34 then the corresponding class for converter life is 70. 

[3] . If FeO in slag is in range 35 and mean Si in hot metal is in range 4 then the 
corresponding class for converter life is 70. 

[4] . If FeO in slag is in range 35 and mean Si is in range 5 then corresponding class for 
converter life is 50. 

[5] . If FeO in slag is in range 35 and mean Si is in range 6 then corresponding class for 
converter life is 50. 



CHAPTER 4 


RESULTS AND DISCUSSION 


In this chapter a comparative study of all the methods is done. For comparison 
purpose, various modelling techniques are used which are developed by [7] and [13]. 
Actually, the data set finally obtained by using aforesaid techniques were given input to 
the models developed in [7] and [13]. 

First, a comparison is drawn between data prioritisation techniques. The 
variables selected firom PCA, ICA, and experience of R&D persons of RDCIS,SAIL. 
The main idea behind using modelling as comparison ground is that if the selected 
variables are able to capture the variation in the data then Ihe developed model will be 
able to predict the converter life with a good accuracy and can be used for future use. If 
we are unable to get a good model then it may be deduced that the total sample variation 
is not captured. Therefore, either we have to use a different set of variables or to 
increase the number of selected variables. Modelling is done using simple regression 
and auto regressive moving average in [13]and group method of data handling in [7]. 
Models are also developed using clusterwise regression in [13]m which all these 
clustering methods have been employed. Results are presented in terms of error 
statistics in the tables 4.1 to 4.10, 

4.1 Comparison of Data Prioritisation techniques 

From tables 4.1, 4.2 and 4.3 it is observed that best training in most of the cases 
takes place for ICA data. The mean training error in GMDH using ICA data is 1.5546 
and mean prediction error is 6.5562. The training and prediction slope are 1.0046 and 
1.0657 respectively. These quantities for PCA data are 4.7543, 6.198, 0.9981 and 
0.9973. For Rd data mean training error is 5.9776 and prediction error is 5.1668. The 
training slope is 1,0052 and prediction slope is 0.9483. Similarly in regression also we 
are having best training and prediction statistics for ICA data foUowed by R&D data and 
then PCA da ta , In ARMA model also ICA data is giving best result The PCA data is 
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giving fsr better results than R&D data. Therefore, we can say that the variables selected 
from ICA are able to capture most of the variation present in samples. PCA and R&D 
experience are also able to capture the variation but not upto that extent to which ICA is 
able to capture. 

4.2 Comparison between various clustering methods 

All the clustering methods are compared using clusterwise regression. The 
results are presented in the tables 4.4 to 4.10. Clusterwise regression has been 
performed for 3 and 4 clusters using polynomial. All the three data sets are used in 
modelling. In most of the cases K-means clustering is giving best result for both 3 and 4 
clusters. Fuzzy C-means clustering is also giving good result followed by Kohenon 
SOM and ART2 network. For Fuzzy ART network, though training statistics is 
comparable to other techniques but prediction is very bad. This means for Fuzzy ART 
network, modelled system is unable to capture the trend. This means the data partition is 
not good enou^ in this case. The other reason for bad results of Fuzzy ART network is 
that the sample used for prediction is always forming a separate cluster. So, the model is 
not able to fit a curve for this cluster and that is why we are having such bad results. The 
table for ID3 is provided separately because here we do not have any control over the 
number of cluster created. But we are having very good training here for ICA and PCA. 

4.3 Conclusion 

• Independent component analysis is best result for all modelling schemes; 
therefore we can say that independent components better capture the 
variation present in samples. 

• Though for PCA data, training is always good but prediction is not up to the 
mark. This does not mean that the variables selected through PCA do not 
represent the variation but not to the extent to which ICA is capturing. 

• In clustering techniques, K-means clustering and fuzzy C-means are giving 
better results, therefore we can say that data partition is better when we pre 
specify the number of clusters required. 

• For Fuzzy ART prediction is worst because it is classifying the last sample in 
all the cases considered here (for symmetry in results) in a separate cluster. 
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mn error - mean error, ms error - mean square error, rms error = root mean square error, error std = standard 
deviation of error, max error = maximum error, min error = minimum error 























































































































































































































4.4 Recommendation for Future work 

A lot of data prioritisation and dimensionality reduction methods are available. 
Here, we have used PCA and ICA for feature prioritisation but these techniques can also 
be used for the transformation of the feature space. So, modelling should also be done 
using transformed feature space. May be we get better results. The experience can be 
validated using Saaty’s method of feature prioritisation. This method has not been used 
here which should be used. Apart Jfrom this a lot of clustering techniques are also 
available, which can also be used. 
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